<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Transactional guarantees</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="rep.html" title="Chapter 13.  Berkeley DB Replication" />
    <link rel="prev" href="rep_bulk.html" title="Bulk transfer" />
    <link rel="next" href="rep_lease.html" title="Master leases" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 12.1.6.2</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Transactional guarantees</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 13.  Berkeley DB Replication </th>
          <td width="20%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="rep_trans"></a>Transactional guarantees</h2>
          </div>
        </div>
      </div>
      <p>
        It is important to consider replication in the context of
        the overall database environment's transactional guarantees.
        To briefly review, transactional guarantees in a
        non-replicated application are based on the writing of log
        file records to "stable storage", usually a disk drive. If the
        application or system then fails, the Berkeley DB logging
        information is reviewed during recovery, and the databases are
        updated so that all changes made as part of committed
        transactions appear, and all changes made as part of
        uncommitted transactions do not appear. In this case, no
        information will have been lost.
    </p>
      <p>
        If a database environment does not require the log be
        flushed to stable storage on transaction commit (using the
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag to increase performance at the cost of
        sacrificing transactional durability), Berkeley DB recovery
        will only be able to restore the system to the state of the
        last commit found on stable storage. In this case, information
        may have been lost (for example, the changes made by some
        committed transactions may not appear in the databases after
        recovery).
    </p>
      <p>
        Further, if there is database or log file loss or corruption
        (for example, if a disk drive fails), then catastrophic
        recovery is necessary, and Berkeley DB recovery will only be
        able to restore the system to the state of the last archived
        log file. In this case, information may also have been
        lost.
    </p>
      <p>
        Replicating the database environment extends this model, by
        adding a new component to "stable storage": the client's
        replicated information. If a database environment is
        replicated, there is no lost information in the case of
        database or log file loss, because the replicated system can
        be configured to contain a complete set of databases and log
        records up to the point of failure. A database environment
        that loses a disk drive can have the drive replaced, and it
        can then rejoin the replication group.
    </p>
      <p>
        Because of this new component of stable storage, specifying
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> in a replicated environment no longer
        sacrifices durability, as long as one or more clients have
        acknowledged receipt of the messages sent by the master. Since
        network connections are often faster than local synchronous
        disk writes, replication becomes a way for applications to
        significantly improve their performance as well as their
        reliability.
    </p>
      <p>
        Applications using Replication Manager are free to use
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit.
        The behavior of the <span class="bold"><strong>send</strong></span>
        function that Replication Manager provides on the
        application's behalf is determined by an "acknowledgement
        policy", which is configured by the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV-&gt;repmgr_set_ack_policy()</a> method.
        Clients always send acknowledgements for <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
        messages (unless the acknowledgement policy in effect
        indicates that the master doesn't care about them). For a
        <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the master blocks the sending
        thread until either it receives the proper number of
        acknowledgements, or the <a href="../api_reference/C/repset_timeout.html#set_timeout_DB_REP_ACK_TIMEOUT" class="olink">DB_REP_ACK_TIMEOUT</a> expires. In the
        case of timeout, Replication Manager returns an error code
        from the <span class="bold"><strong>send</strong></span> function,
        causing Berkeley DB to flush the transaction log before
        returning to the application, as previously described. The
        default acknowledgement policy is <a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_QUORUM" class="olink">DB_REPMGR_ACKS_QUORUM</a>,
        which ensures that the effect of a permanent record remains
        durable following an election.
    </p>
      <p>
        The rest of this section discusses transactional guarantee
        considerations in Base API applications.
    </p>
      <p>
        The return status from the application's <span class="bold"><strong>send</strong></span> function must be set by the
        application to ensure the transactional guarantees the
        application wants to provide. Whenever the <span class="bold"><strong>send</strong></span> function returns failure, the
        local database environment's log is flushed as necessary to
        ensure that any information critical to database integrity is
        not lost. Because this flush is an expensive operation in
        terms of database performance, applications should avoid
        returning an error from the <span class="bold"><strong>send</strong></span> function,
        if at all possible.
    </p>
      <p>
        The only interesting message type for replication
        transactional guarantees is when the application's <span class="bold"><strong>send</strong></span> function was called with the
        <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag specified. There is no reason for the
        <span class="bold"><strong>send</strong></span> function to ever
        return failure unless the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag was
        specified -- messages without the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag do
        not make visible changes to databases, and the <span class="bold"><strong>send</strong></span> function can return success to
        Berkeley DB as soon as the message has been sent to the
        client(s) or even just copied to local application memory in
        preparation for being sent.
    </p>
      <p>
        When a client receives a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the
        client will flush its log to stable storage before returning
        (unless the client environment has been configured with the
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> option). If the client is unable to flush a
        complete transactional record to disk for any reason (for
        example, there is a missing log record before the flagged
        message), the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method on the client
        will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> and return the LSN of this record
        to the application in the <span class="bold"><strong>ret_lsnp</strong></span> parameter. 
        The application's client or master message handling loops should take proper 
        action to ensure the correct transactional guarantees in this case. When
        missing records arrive and allow subsequent processing of
        previously stored permanent records, the call to the
        <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method on the client will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a>
        and return the largest LSN of the permanent records that were
        flushed to disk. Client applications can use these LSNs to
        know definitively if any particular LSN is permanently stored
        or not.
    </p>
      <p>
        An application relying on a client's ability to become a
        master and guarantee that no data has been lost will need to
        write the <span class="bold"><strong>send</strong></span> function to
        return an error whenever it cannot guarantee the site that
        will win the next election has the record. Applications not
        requiring this level of transactional guarantees need not have
        the <span class="bold"><strong>send</strong></span> function return
        failure (unless the master's database environment has been
        configured with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>), as any information critical
        to database integrity has already been flushed to the local
        log before <span class="bold"><strong>send</strong></span> was
        called.
    </p>
      <p>
        To sum up, the only reason for the <span class="bold"><strong>send</strong></span> function
        to return failure is when the
        master database environment has been configured to not
        synchronously flush the log on transaction commit (that is,
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on the master), the
        <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag is specified for the message, and the
        <span class="bold"><strong>send</strong></span> function was unable
        to determine that some number of clients have received the
        current message (and all messages preceding the current
        message). How many clients need to receive the message before
        the <span class="bold"><strong>send</strong></span> function can return
        success is an application choice (and may not depend as much
        on a specific number of clients reporting success as one or
        more geographically distributed clients).
    </p>
      <p>
        If, however, the application does require on-disk durability
        on the master, the master should be configured to
        synchronously flush the log on commit. If clients are not
        configured to synchronously flush the log, that is, if a
        client is running with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configured, then it is
        up to the application to reconfigure that client appropriately
        when it becomes a master. That is, the application must
        explicitly call <a href="../api_reference/C/envset_flags.html" class="olink">DB_ENV-&gt;set_flags()</a> to disable asynchronous log
        flushing as part of re-configuring the client as the new
        master.
    </p>
      <p>
        Of course, it is important to ensure that the replicated
        master and client environments are truly independent of each
        other. For example, it does not help matters that a client has
        acknowledged receipt of a message if both master and clients
        are on the same power supply, as the failure of the power
        supply will still potentially lose information.
    </p>
      <p>
        Configuring a Base API application to achieve the proper mix
        of performance and transactional guarantees can be complex. In
        brief, there are a few controls an application can set to
        configure the guarantees it makes: specification of
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the master environment, specification of
        <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the client environment, the priorities of
        different sites participating in an election, and the behavior
        of the application's <span class="bold"><strong>send</strong></span>
        function.
    </p>
      <p>
        First, it is rarely useful to write and synchronously flush
        the log when a transaction commits on a replication client. It
        may be useful where systems share resources and multiple
        systems commonly fail at the same time. By default, all
        Berkeley DB database environments, whether master or client,
        synchronously flush the log on transaction commit or prepare.
        Generally, replication masters and clients turn log flush off
        for transaction commit using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag.
    </p>
      <p>
        Consider two systems connected by a network interface. One
        acts as the master, the other as a read-only client. The
        client takes over as master if the master crashes and the
        master rejoins the replication group after such a failure.
        Both master and client are configured to not synchronously
        flush the log on transaction commit (that is, <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
        was configured on both systems). The application's <span class="bold"><strong>send</strong></span> function never returns failure
        to the Berkeley DB library, simply forwarding messages to the
        client (perhaps over a broadcast mechanism), and always
        returning success. On the client, any <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> returns
        from the client's <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method are ignored, as well.
        This system configuration has excellent performance, but may
        lose data in some failure modes.
    </p>
      <p>
        If both the master and the client crash at once, it is
        possible to lose committed transactions, that is,
        transactional durability is not being maintained. Reliability
        can be increased by providing separate power supplies for the
        systems and placing them in separate physical
        locations.
    </p>
      <p>
        If the connection between the two machines fails (or just
        some number of messages are lost), and subsequently the master
        crashes, it is possible to lose committed transactions. Again,
        transactional durability is not being maintained. Reliability
        can be improved in a couple of ways:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            <p>
                Use a reliable network protocol (for example,
                TCP/IP instead of UDP). 
            </p>
          </li>
          <li>
            <p>
                Increase the number of clients and network paths to
                make it less likely that a message will be lost. In
                this case, it is important to also make sure a client
                that did receive the message wins any subsequent
                election. If a client that did not receive the message
                wins a subsequent election, data can still be lost.
            </p>
          </li>
        </ol>
      </div>
      <p>
        Further, systems may want to guarantee message delivery to
        the client(s) (for example, to prevent a network connection
        from simply discarding messages). Some systems may want to
        ensure clients never return out-of-date information, that is,
        once a transaction commit returns success on the master, no
        client will return old information to a read-only query. Some
        of the following changes to a Base API application may be used
        to address these issues:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            <p> 
                Write the application's <span class="bold"><strong>send</strong></span> function 
                to not return to Berkeley DB until one or more clients have
                acknowledged receipt of the message. The number of
                clients chosen will be dependent on the application:
                you will want to consider likely network partitions
                (ensure that a client at each physical site receives
                the message) and geographical diversity (ensure that a
                client on each coast receives the message).
            </p>
          </li>
          <li>
            <p>
                Write the client's message processing loop to not
                acknowledge receipt of the message until a call to the
                <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method has returned success. Messages
                resulting in a return of <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> from the
                <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method mean the message could not be
                flushed to the client's disk. If the client does not
                acknowledge receipt of such messages to the master
                until a subsequent call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method
                returns <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and the LSN returned is at
                least as large as this message's LSN, then the
                master's <span class="bold"><strong>send</strong></span>
                function will not return success to the Berkeley DB
                library. This means the thread committing the
                transaction on the master will not be allowed to
                proceed based on the transaction having committed
                until the selected set of clients have received the
                message and consider it complete.
            </p>
            <p>
                Alternatively, the client's message processing loop
                could acknowledge the message to the master, but with
                an error code indicating that the application's
                <span class="bold"><strong>send</strong></span> function
                should not return to the Berkeley DB library until a
                subsequent acknowledgement from the same client
                indicates success. 
            </p>
            <p>
                The application send callback function invoked by
                Berkeley DB contains an LSN of the record being sent
                (if appropriate for that record). When <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a>
                method returns indicators that a permanent record has
                been written then it also returns the maximum LSN of
                the permanent record written.
            </p>
          </li>
        </ol>
      </div>
      <p>
        There is one final pair of failure scenarios to consider.
        First, it is not possible to abort transactions after the
        application's <span class="bold"><strong>send</strong></span> function
        has been called, as the master may have already written the
        commit log records to disk, and so abort is no longer an
        option. Second, a related problem is that even though the
        master will attempt to flush the local log if the <span class="bold"><strong>send</strong></span> function returns failure, that
        flush may fail (for example, when the local disk is full).
        Again, the transaction cannot be aborted as one or more
        clients may have committed the transaction even if <span class="bold"><strong>send</strong></span> returns failure. Rare
        applications may not be able to tolerate these unlikely
        failure modes. In that case the application may want
        to:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            <p> 
                Configure the master to do always local synchronous
                commits (turning off the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
                configuration). This will decrease performance
                significantly, of course (one of the reasons to use
                replication is to avoid local disk writes.) In this
                configuration, failure to write the local log will
                cause the transaction to abort in all cases.
            </p>
          </li>
          <li>
            <p> 
                Do not return from the application's <span class="bold"><strong>send</strong></span> function under any
                conditions, until the selected set of clients has
                acknowledged the message. Until the <span class="bold"><strong>send</strong></span> function returns to
                the Berkeley DB library, the thread committing the
                transaction on the master will wait, and so no
                application will be able to act on the knowledge that
                the transaction has committed.
            </p>
          </li>
        </ol>
      </div>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="rep.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Bulk transfer </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Master leases</td>
        </tr>
      </table>
    </div>
  </body>
</html>
